University of Sydney’s Data Students Analysis with Pandas and R
Recommendation/Insight
Exploring the academic and socioeconomic trends among domestic and international students in DATA1X01. In general, international students might be more interested in data science, and pay higher rent than domestic students as they tend to live closer to Sydney Central. Insights may offer useful insights for future campus policy improvements.
Evidence
IDA
Code
# Loading the necessary librarieslibrary(tidyverse)library(plotly)library(gganimate)library(kableExtra)library(dplyr)library(gghighlight)# Reading the datasetsData1001Survey =read.csv("G:/YEAR 1/SEMESTER 1/DATA1001/Coding Stuff/Project 1/Data1001Survey_Cleaned.csv")Survey_uncleaned =read.csv("G:/YEAR 1/SEMESTER 1/DATA1001/Coding Stuff/Project 1/data1001_survey_data.csv")
Source
Used DATA1X01 census data provided on the Canvas page. It was collected from the survey that DATA1X01’s students filled out in Week 2.
Structure
Code
str(Survey_uncleaned)
'data.frame': 533 obs. of 30 variables:
$ Progress : int 100 100 100 100 100 100 100 100 100 100 ...
$ consent : chr "I consent to take part in the study" "I consent to take part in the study" "I consent to take part in the study" "I consent to take part in the study" ...
$ age : int 20 18 20 19 18 20 18 21 19 20 ...
$ gender : chr "Female" "Female" "Female" "Male" ...
$ country_of_birth : chr "Other Please Specify" "Other Please Specify" "China" "Australia" ...
$ country_of_birth._5_TEXT: chr "Singapore" "Indonesia" "" "" ...
$ hours_work : num 0 0 160 1 0 0 15 21 0 0 ...
$ social_media_use : num 10.15 3.47 1.12 23.99 1 ...
$ rent : num 378 549 0 0 300 0 0 860 650 0 ...
$ friends_count : int 5 15 2 0 6 3 15 5 5 0 ...
$ si_1 : int 7 2 5 0 3 6 7 5 0 5 ...
$ highest_speed : num 100 120 130 200 120 160 120 110 120 60 ...
$ relationship_status : chr "Single" "Single" "Its complicated" "Married" ...
$ dates : int 3 0 2 50 0 3 0 4 0 1 ...
$ standard_drinks : num 1 0 0 8 4 12 0 0 12 1 ...
$ countries : int 8 15 10 0 4 4 30 10 8 4 ...
$ drug_use_q : chr "Have you ever used recreational drugs" "Have you ever used recreational drugs" "Have you ever gotten high off illicit drugs?" "Have you ever used recreational drugs" ...
$ drug_usea : chr "No" "No" "" "Yes" ...
$ drug_useb : chr "" "" "No" "" ...
$ drug_use_ans : chr "No" "No" "No" "Yes" ...
$ student_type : chr "International" "International" "Domestic" "Domestic" ...
$ mainstream_advanced : chr "DATA1001" "DATA1901" "DATA1001" "DATA1901" ...
$ semesters : num 1 0 4 20 0 4 0 4 3 0 ...
$ commute : num 3 10 75 1 90 9 20 45 20 20 ...
$ data_interest._1 : int 3 7 4 0 7 7 5 7 2 5 ...
$ mark_goal : int 80 70 85 100 85 95 80 95 90 80 ...
$ hours_studying : int 2 5 5 60 4 4 3 8 2 7 ...
$ lecture_mode : chr "Lecture Slides" "I" "Lecture Slides" "Lecture Slides" ...
$ study_type : chr "I leave things to the last minute" "I work steadily all semester" "I work steadily all semester" "I leave things to the last minute" ...
$ learner_style : chr "Style 2" "Style 3" "Style 3" "Style 3" ...
Before cleaning, the dataset had 533 entries and 30 variables (16 numerical, 14 categorical). We then identified the key variables.
RQ1 Key Variables:
student_type (categorical, either domestic or international)
data_interest (numerical, a 1-10 scale)
RQ2 Key Variables:
student_type
commute (numerical, length of commute in minutes)
rent (numerical, rent per week in AUD)
Limitations
Some students submitted “unrealistic” data, like claiming to pay $6000 per week on rent.
Code
ggplot(Survey_uncleaned) +geom_point(aes(x=commute, y = rent)) +labs(x="Commute Time", y ='Rent') +gghighlight(rent >6000) +geom_label(label ="Big Outlier",x =40,y =7000,label.size =0.35,color ="white",fill="#ff0000" )
The survey does not specify commute time via which mode of transportation, which may create outliers.
More data, like family income, is needed to fully assess the linear correlation.
Assumptions
Any “unrealistic” data is likely from students’ misunderstanding or misinputs.
Most students input their commute time on foot, as it is the most common mode of transport.
There should be no significant bias.
Data Cleaning
Code
# (Python code used:# # import pandas as pd# # df = pd.read_csv("G:/YEAR 1/SEMESTER 1/DATA1001/Coding Stuff/Datasets/Project 1/data1001_survey_data.csv") df.columns = df.columns.str.strip()# # "Cleaning up the "Country_of_birth" column"# # df.rename(columns={"country_of_birth _5_TEXT":"COB", "si_1":"stress_level", }, inplace=True) df["COB"] = df['COB'].fillna(df["country_of_birth"])# df['COB'] = df['COB'].replace({ 'Viet Nam': 'Vietnam', 'Republic Of Korea': 'South Korea', 'Republic Of China (Taiwan)': 'Taiwan', 'Hong Kong Sar': 'Hong Kong', 'Nz': 'New Zealand', 'Uk': 'United Kingdom', 'Uae': 'United Arab Emirates', "Việt Nam": "Vietnam", "Korea": "South Korea", "England": "United Kingdom" })# # "General cleaning, such as dropping duplicates, removing "unrealistic" rows, etc."# # df = df.drop_duplicates()# df = df.drop(index=[163,85])# df = df.drop(index=df[df["COB"] == "Other Please Specify"].index)# df = df.drop(index=df[df["rent"] > 5000].index)# # "Dropping unrelated/valueless columns"## df = df.drop(columns=['Progress','country_of_birth','consent'])# # "Dropping all rows with blank value(s) in numeric columns"# # numeric_columns = df.select_dtypes(include=['number']).columns# df = df.dropna(subset=numeric_columns)# # "Resetting the index after cleaning"## df.index.name = "student_no"# df = df.reset_index()# df.drop(columns="student_no", inplace = True)
Removed rows that have duplicates or blank values for data uniformity.
Removed rows that have a ‘rent’ variable value higher than 5000.
Removed extraneous columns after filtering like “consent.”
Renamed some columns for better coherence.
Findings
Research Question 1
Does a student’s type of enrolment (domestic or international) closely relate to their data interest?
Code
# Creating the boxplotggplot(Data1001Survey, aes(x=data_interest._1, fill=student_type)) +geom_boxplot() +theme_classic() +scale_fill_brewer(palette="Dark2") +labs(x ="Data interest (1-10)", title ="Enrolment Type vs Data Interest for DATA1X01 Students at USYD", caption ="Presented by the Unknown Group", fill ="Student Type")
Code
# Calculate Q1, Q3, and median for both student groupssummary =summarise(group_by(Data1001Survey,student_type),Q1 =fivenum(data_interest._1)[2],Q3 =fivenum(data_interest._1)[4],Median =median(data_interest._1) )# Making the tablekbl(summary, caption ="Domestic vs International Data Interest Number Comparison") %>%kable_classic(full_width = T, html_font ="spectral")
Domestic vs International Data Interest Number Comparison
student_type
Q1
Q3
Median
Domestic
3
6
5
International
5
8
6
From the boxplot and table, we hypothesize that international students are more interested in data than domestic students. For internationals, the 25th (5) and 75th percentile (8), and the median (6) are all higher than domestic students (3, 6, and 5 respectively). The IQR for both student groups stand at a 3, which suggests there is not too much variance in the data.
The trend can be explained in certain ways:
Australia, with its elite education system and rapidly growing data industry with strong demand, attracts many young data enthusiasts.
Most internationals come from Asia, where ultra-disciplined teaching systems often overlook data science. Thus, they may pursue data science due to genuine interest developed from childhood, especially considering big data’s popularity in Asia (Cornelli et al., 2021).
International students usually predetermine their majors before university, making emerging subjects like data science especially popular. Meanwhile, domestic students have more freedom to explore diverse disciplines, which leads to more dispersed interests.
Research Question 2 (Linear Model)
Is the linear correlation between students’ weekly rent and commute time to university impacted by their enrolment type?
Code
# Create initial scatterplot and linear regression lines_p =ggplot(Data1001Survey,aes(x=commute,y=rent)) +geom_point(aes(color=student_type)) +geom_smooth(method = lm, se = F)+scale_color_brewer(palette='Dark2') +labs(title ="Commute Time vs Rent for {closest_state}", caption ="Presented by the Unknown Group",x ="Commute Time (minutes)",y ="Rent (AUD p/w)") +theme(legend.position ="none") # Remove legend+# Add animationss_p +transition_states(student_type,transition_length =2,state_length =1) +enter_fade() +exit_fade()
Correlation coefficient calculation result
Code
# Calculating the overall correlation between commute time and rentcor(x = Data1001Survey$commute,y = Data1001Survey$rent, use ="complete.obs")
[1] -0.3912924
Residual plots
Code
# Create dataframes for specficially internal and domestic studentsdom = Data1001Survey[Data1001Survey$student_type =="Domestic", ]intl = Data1001Survey[Data1001Survey$student_type =="International", ]# Fit regression modelsmodel_dom =lm(rent ~ commute, data = dom)model_intl =lm(rent ~ commute, data = intl)# Create residual plots for each student groupintl_plot =ggplot(model_intl, aes(x = .fitted,y = .resid)) +geom_point() +geom_hline(yintercept =0,linetype ="dashed",colour ="red") +labs(title="International", x='Fitted', y='Residuals')# Makes the plot interactiveggplotly(intl_plot)
The graphs indicate a slight linear correlation: as both groups’ commute times increase, their rents decrease. Along with a moderate correlation coefficient (r ≈ -0.4) and relatively random residual plots, applying a linear model to the analysis is practical.
The scatterplots also suggest that most students prefer renting closer to university - a common trend worldwide.
Interestingly, the internationals’ regression line has a much higher intercept, but similar slope than domestic students, suggesting they choose pricier rentals than domestic students for the same commute times. There might be a few reasons for this:
Numerous domestic students often live with their parents, having to pay little to no rent.
Internationals may view expensive rentals as worthy investments environment-wise, safety-wise, and entertainment-wise, hence the popularity of CBD rentals (Soong & Mu, 2024).
International students, unfamiliar with the local rental market, may overpay – a problem domestic students can easily avoid.
We did expect a stronger negative correlation between commute time and rent, as seen in Australia (Troy et al., 2019) and globally. Therefore, to achieve better modelling, the sample size still needs significant improvement.
Declaration on Professional Ethics
Shared Professional Values: Respect
The data used to produce this project excluded non-consent subjects to pay our respects to the privacy of survey participants, avoiding potential breach of privacy policy.
Maintaining Confidence in Statistics
Provided concrete figures, realistic hypotheses, and listed possible limitations of the analysis to properly and truthfully inform the readers of the research result.
Cleaned data to exclude the outliers or irrelevant values, unify data and improve coherence.
Acknowledged the limitations.
Highlighted areas for possible improvements, aiding future research.
Cornelli, G., Doerr, S., Gambacorta, L., & Tissot, B. (2021). Big data in Asian central banks. Asian Economic Policy Review, 17(2), 255-269.https://doi.org/10.1111/aepr.12376
Troy, L., van den Nouwelant, R., & Randolph, B. (2019). Estimating need and costs of social and affordable housing delivery (pp. 1–20). City Futures Research Centre. https://cityfutures.ada.unsw.edu.au/documents/522/Modelling_costs_of_housing_provision_FINAL.pdf